Introduction

The goal of this project is to know whether a savings customer will take a credit or not. We have different sources of data, including savings account transactions, ZIP code, ATM geographical and transactional information and open data regarding crime and sociodemographic areas in Mexico.

In this document we present an exploratory data analysis on the available information. There are around 12 million savings customers and 800 thousand credit AND savings customers in Banco Azteca (BAZ), from which we have a sample of 1 million people for savings. The analysis in this document is based on the information of this sample and the whole population of credit customers.

Variable Analysis

Transactions and Amount

abonos abonos_monto retiros retiros_monto num_meses tiempo_meses freq
0% 0 0.000 0 -39804335.20 1 1 0.0312500
5% 1 1.000 0 -135549.10 1 6 0.0526316
10% 1 50.000 0 -80000.00 1 8 0.0714286
15% 1 150.000 1 -55267.97 2 11 0.0967742
20% 1 554.000 1 -40600.00 2 13 0.1250000
25% 2 1212.975 2 -30724.68 3 14 0.1428571
30% 2 2075.000 2 -23850.00 3 16 0.1666667
35% 3 3200.000 3 -18616.00 4 18 0.2000000
40% 3 4636.000 4 -14750.00 4 19 0.2222222
45% 4 6200.000 5 -11500.00 5 21 0.2500000
50% 5 8350.000 7 -9006.00 5 23 0.2812500
55% 6 10800.000 8 -6940.00 6 25 0.3125000
60% 8 14000.000 10 -5170.00 7 27 0.3333333
65% 9 17850.000 12 -3897.79 8 29 0.3600000
70% 11 22917.640 15 -2650.00 9 30 0.3750000
75% 14 29950.000 19 -1650.00 10 31 0.3846154
80% 18 39550.000 25 -900.00 10 32 0.4210526
85% 24 54000.000 32 -300.00 11 32 0.5000000
90% 33 78500.000 45 0.00 12 32 0.5882353
95% 53 132810.063 72 0.00 12 32 0.7500000
100% 60199 42131739.960 22009 0.00 12 32 1.0000000

Gender

Tenure

Salary based on transactions

Number of months / Total months: Value between 0 and 1. If it’s 1 it means that the customer made an activity in all of the months that are available in the data; if it’s 0 it means that no activity took place. The value of 0 is not possible in this database because these are customers with at least one transaction.

Time of the day

Pending

Average time between transactions

Pending

Frequency of transactions

Pending

Investment Account?

Pending

e-banking account

credit_or_savings electronic_banking proporción
credit 0 0.9509844
credit 1 0.0490156
savings 0 0.9553800
savings 1 0.0446190
credit_or_savings active_electronic_banking proporción
credit 0 0.9686378
credit 1 0.0313622
savings 0 0.9772680
savings 1 0.0227310

Months with more activity

Geography

We have information about the customers’ ZIP code. This information could be used, with public available information from sources like INEGI, to know the socioeconomic level of each savings customer.

Available sources:

AGEB stands for Área GeoEstadística Básica (Basic Geostatistical Area), and a locality is a general term used by CONAPO to define several AGEBs.

This document uses information from the socioeconomic regions defined by INEGI.

ZIP code geographical information is available. According to the official postal code webpage, there are 32,448 different ZIP codes in Mexico, from which around 25,000 are available as shape files.

Problem:

The polygons defining the ZIP codes aren’t equivalent to the polygons defining the AGEBs, so a mapping between them is needed to be able to use the public available information.

Possible solutions:

  1. Mapping from centroid to centroid
  2. Polygon convex combination

First approach: mapping from centroid to centroid

Perhaps the simplest solution is to find the centroid of each ZIP code and AGEB, and then just map a given ZIP code to the closest AGEB centroid.

We have a classification for each AGEB that pretends to show the differences among AGEBs based on indicators related with housing, education, health and employment, built from the last population census. Each AGEB can be classified in 7 strata such that stratum 7 contains AGEBs with the most favorable average conditions, and in stratum 1 are the AGEBs with the least favorable average conditions.

In the next images, maps of Mexico City and surroundings, Monterrey and Guadalajara are shown.

Map with centroids of each polygon:

Now, same map for Guadalajara, Jalisco:

And finally, for Monterrey, Nuevo León:

ZIP code information with their centroids can be seen in the next map of Mexico City:

ZIP code information with their centroids can be seen in the next map of Guadalajara. Some of the centroids may not match perfectly the polygon plotted because the database considers a the ZIP code and the identifier as a different group.

ZIP code information with their centroids can be seen in the next map of Monterrey:

Finally, plotting the centroids of AGEBs and ZIP codes in Mexico City altogether we get:

Guadalajara:

Monterrey:

So, for each available ZIP code, the closest AGEB centroid is found and a mapping is made to assign an AGEB to each ZIP code, such that we get a table in the following format:

ZIP ZIP long ZIP lat Nearest AGEB AGEB long AGEB lat Distance in Km Classification
1000 -99.19317 19.34607 9.01e+11 -99.19294 19.34740 0.1495517 7
1010 -99.19406 19.36048 9.01e+11 -99.19487 19.36071 0.0883245 7
1020 -99.18719 19.35680 9.01e+11 -99.18784 19.35412 0.3053818 7
1030 -99.17920 19.35739 9.01e+11 -99.17872 19.35448 0.3273190 7
1040 -99.19157 19.35557 9.01e+11 -99.19429 19.35471 0.3002093 7
1048 -99.19486 19.35644 9.01e+11 -99.19429 19.35471 0.2008129 7
1049 -99.19664 19.35363 9.01e+11 -99.19429 19.35471 0.2750874 7
1050 -99.18245 19.34968 9.01e+11 -99.18462 19.34641 0.4280162 7
1060 -99.19780 19.34902 9.01e+11 -99.19524 19.35028 0.3016385 7
1070 -99.18643 19.34456 9.01e+11 -99.18462 19.34641 0.2800899 7

This approach may fail since, as one can see, ZIP code polygons are generally bigger in area than AGEBs, so the heterogeneity of each ZIP code is being ignored.

Second approach: pending

Customer analysis

First, let’s see what’s the distribution of the classification of AGEBs in the country. Remember that 7 is that the AGEB is “good” in average and that 1 is that it’s “bad”.

And now, the mapping of the ZIP codes:

The distribution changed drastically. As we can see in the following graph, originally the AGEBs were urban (U) and rural (R), but the mapping consists of only urban ZIP codes; so this may be a reason of why the distribution changed so much.

And now let’s analyze the sample with 1 million savings customers.

Out of the 1 million people, we have the mapping ZIP code for ‘r sum(datos\(zip_code %in% mapeo\)CP)’ of them, which are distributed the following way:

ATM information

We have potential information regarding all the transactions in every ATM, and we have the exact location of each ATM (as can be seen in the following map). So, the idea is to use this information to know the income spatial distribution.

Crime Rate

Pending